Probabilistic inference in machine learning using a quantum oracle

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using a quantum oracle to make inference in complex machine learning models that is capable of solving artificial intelligent problems. Input to the quantum oracle is derived from the training data and the model parameters, which maps at least part of the interactions of interconnected units of the model to the interactions of qubits in the quantum oracle. The output of the quantum oracle is used to determine values used to compute loss function values or loss function gradient values or both during a training process.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 61/876,744, titled “Probabilistic Inference in Machine Learning Using a Quantum Oracle”, filed on Sep. 11, 2013. The disclosure of the foregoing application is incorporated herein by reference in its entirety for all purposes.

BACKGROUND

This specification relates to using a quantum oracle in probabilistic inference in machine learning of models, including Boltzmann machines and undirected graphical models.

SUMMARY

The core of artificial intelligence and machine learning problems is to train a parametric model, e.g., a Boltzmann machine or an undirected graphical model, based on the observed training data, which may include labels and input features. The training process can benefit from using a quantum oracle for probabilistic inference. Input to the quantum oracle during training is derived from the training data and the model parameters, which maps at least part of the interactions of interconnected units of the model to the interactions of qubits in the quantum oracle. The output of the quantum oracle is used to determine values used to compute loss function values or loss function gradient values or both during a training process. The quantum oracle may also be used for inference during the prediction stage, i.e., at the performance stage, during which loss functions of different candidate output labels are compared.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will be apparent from the description, the drawings, and the claims.

DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic diagram of an example unrestricted Boltzmann machine.

FIG. 2 is a schematic perspective view of flux qubits in a quantum machine.

FIG. 3 is a Chimera graph showing interacting qubits on an example chip of a quantum machine.

FIG. 4A is a schematic plot of two example Hamiltonians.

FIG. 4B is a flow diagram showing an example process of defining and annealing a Hamiltonian.

FIG. 5 is a schematic diagram of an example restricted Boltzmann machine.

FIG. 6 is a schematic diagram of an example tempered interacting Boltzmann machine.

FIG. 7 is a schematic diagram of an example undirected graphical model.

FIG. 8 is a schematic diagram of an example undirected graphical model with one or more latent variables.

FIG. 9 is a flow diagram of an example process for learning an inference in a tempered interacting Boltzmann machine.

FIG. 10 is a flow diagram of an example process for learning an inference in an undirected graphical model with latent variables.

DETAILED DESCRIPTION

Overview

Computer models can be trained to solve difficult and interesting problems, e.g., a wide range of machine learning problems such as classification problems, pattern matching problems, image recognition problems, speech recognition problems, voice recognition problems, or object recognition problems.

Useful models include Boltzmann machines, undirected graphical models, or stochastic recurrent neural networks. During training, model parameters are determined with the goal that the trained model optimally fits the labeled observed data from any of the problems to be solved. Part of or the entire training of the model can be computationally intractable, depending on the size or complexity of the model, or both. For example, the time in order required to collect equilibrium statistics for probabilistic inference can grow exponentially with the size (e.g., number of parameters or units in a network) of the model.

This specification will describe how a quantum oracle can be used to train a model, in particular when probabilistic inference of the model or part of the model is computationally intractable. A quantum oracle can be implemented using a quantum machine as will be described. An example quantum machine is any of the D-Wave superconducting adiabatic quantum computing (AQC) systems (available from D-Wave Systems Inc., British Columbia, Canada). Other alternative quantum information processors can also be used as quantum oracles. Some of these processors can perform universal AQC without being limited by 2-local stochastic Hamiltonians and can perform arbitrary or universal quantum operations in the adiabatic regime. Other quantum processors can perform computations based on a circuit model of computation with discrete logical gates including, e.g., trapped ion systems, neutral atoms in optical lattices, integrated photonic circuits, superconducting phase or charge qubit architectures, nitrogen-vacancy in diamonds, solid-state NMR (nuclear magnetic resonance) systems, and quantum dot structures.

Generally, input to the quantum machine is derived from the model to be trained and the training data, and the quantum machine outputs the desired equilibrium statistics for the trained model. The derivation of the input takes into consideration the structure of the model and the physical structure of the quantum machine. For the D-Wave system, the input has a QUBO (Quadratic Unconstrained Binary Optimization) format. In other words, during training, one or more computational tasks of a model are mapped into a QUBO problem for the quantum machine to solve. For example, when an undirected graphical model has the same structure as the hardware connectivity of the quantum machine (e.g., the D-Wave system), exact inference is made. In other implementations, e.g., when the model is a densely connected graph such as a Boltzmann machine, the quantum machine (e.g., the D-Wave system) is applied in part of the inference to improve the power of the model by adding more interactions between variables than are provided traditional methods.

In some cases, the quantum oracle is used in minimizing a loss function of the model to be trained.

Adiabatic Quantum Computing (AQC) Processors

As shown in FIG. 2, in a conventional D-Wave system, qubits are formed with superconducting Niobium loops 200. As shown in FIG. 3, the D-Wave system contains a chip 206 that includes 8 by 8 unit cells 202 of eight qubits 204, connected by programmable inductive couplers as shown by lines connecting different qubits. Generally, a chip of qubits will be made with a number of qubits that is a power of 2, e.g., 128 or 512 or more qubits. The qubits and couplers between different qubits can be viewed as the vertices and edges, respectively, of a Chimera graph with a bipartite structure.

The chip 206 is a programmable quantum annealing chip. At the end of quantum annealing schedule, the initial transverse Hamiltonian is vanished and the global energy becomes an eigenvalue of the problem Hamiltonian which can be expressed as:

${E\left( {s_{1},\ldots\mspace{14mu},s_{N}} \right)} = {{\sum\limits_{i = 1}^{N}{h_{i}s_{i}}} + {\sum\limits_{{i < j} = 1}^{N}{J_{ij}s_{i}s_{j}}}}$

Where s_(i) represents the ith qubit and is binary: s_(i)∈{−1,+1};

N is the total number of qubits in use; and h_(i), J_(ij)∈

Different problems to be solved can be defined by the different input, real parameters h_(i) and J_(ij), which can be derived from training data. The sparsity of the parameter J_(ij) is constrained by the hardware connectivity (i.e., the connectivity of the qubits shown in FIG. 3). For disconnected qubits, the corresponding J_(ij) is 0.

In searching for a ground state for the global energy defined by a particular problem, adiabatic quantum annealing is applied. By defining a Hamiltonian for a problem to be solved and inputting the parameters of the Hamiltonian, a machine learning training system can use the chip to perform quantum annealing to reach a global ground state, at which point the chip outputs the state of each qubit 204, e.g., in the form of a bit string:

$s^{*} = {{\underset{s}{argmin}{E_{Ising}(s)}} = {\underset{s}{argmax}\left\{ {{\sum\limits_{i,j}{s_{i}J_{ij}s_{j}}} + {\sum\limits_{i}{h_{i}s_{i}}}} \right\}}}$

An example conventional process 400 using quantum annealing is shown in FIGS. 4A and 4B. First, a problem Hamiltonian H_(p) is defined (402), e.g., as H(t)=(1−t/T) H _(B)+t/T H _(P), where H_(B) is an initial Hamiltonian with a known and easily preparable ground state, and H_(P) is the problem Hamiltonian whose ground state encodes the solution to a given instance of an optimization problem, e.g., mapped from the model to be developed. Then, the problem Hamiltonian is annealed (404) to reach a ground state that provides the solution to the optimization problem. During the annealing process, quantum dynamical evolution adds alternative physical paths to escape local minima of the global energy due to quantum tunneling. The quantum oracle performs based on the possibility of multi-particle correlated quantum tunneling through local energy barriers, even when the system does not have enough thermal energy to jump over the energy barriers in a classical process employed in alternative algorithms, such as simulated annealing.

The Adiabatic Algorithms and Their Implementations in Training Different Models

Boltzmann Machine

The Boltzmann machine is an example model which a quantum oracle can be used in training the model or part of the model. The following description is generally applicable to other exponential models. A trained Boltzmann machine can be used for solving a wide range of machine learning problems such as pattern matching problems, image recognition problems, speech recognition problems, voice recognition problems, and object recognition problems.

FIG. 1 is graphical representation of an example conventional Boltzmann machine 100. It is represented as a network of units 102, 104, 106, 108, 110, 112, 114. Each unit is connected with at least one unit through a connection 120. Sometimes not all units are visible units. For example, units 102, 104, 106 can be invisible or hidden units. Each unit is defined with an energy and is stochastic. Each connection 120 can be a parameter that represents the connection strength between the two units. In the example shown in the figure, the hidden units have interactions with each other and the visible units have interactions among themselves. Such a Boltzmann machine is also called an unrestricted Boltzmann machine.

A Boltzmann machine is trained in an iterative way: for the machine with parameters in a certain iteration, it generates samples from the equilibrium distribution of the machine, and then update the parameters based on the discrepancy of these samples and the observed training data.

FIG. 5 shows a graphical representation of an example restricted Boltzmann machine 500 with a reduced level of complication as compared to the Boltzmann model 100 of FIG. 1. The machine 500 contains a layer 504 of invisible/hidden units 508, 510, 512, and two layers 502, 506 of visible units 514, 516, 518; 520, 522, 524. The variables in the visible layers 502, 506 are denoted as x, y, respectively. In building the model, the variables x are the input, which will receive observed data, after the model is trained, and the variables y are the output. The variables in the hidden layer are labeled as s. The units within the same group do not interact with each other, and the units within different groups interact with each other.

The energy function E(x, y, s) of the restricted Boltzmann machine is:

${{- E} = {{\sum\limits_{i}{\sum\limits_{j}{x_{i}w_{ij}s_{j}}}} + {\sum\limits_{n}{\sum\limits_{j}{y_{n}v_{nj}s_{j}}}} + {\sum\limits_{i}{a_{i}x_{i}}} + {\sum\limits_{j}{b_{j}s_{j}}} + {\sum\limits_{n}{c_{n}y_{n}}}}},$ and its probability distribution is:

${p\left( {x,y,s} \right)} = {\frac{1}{Z}{{\exp\left( {- {E\left( {x,y,s} \right)}} \right)}.}}$

In the above equations, i represents the number of visible units in the layer 506; n represents the number of visible units in the layer 502; and j represents the number of hidden units in the layer 504. The term w_(ij) represents the interactions/connections between a hidden unit and a visible unit in the layer 506; and v_(nj) represents the interactions/connections between a hidden unit and a visible unit in the layer 502. The terms a_(i), b_(j), and c_(n) are weights for each unit.

Given a number of training samples ({circumflex over (x)},ŷ), the loss function of the restricted Boltzmann machine can be defined as:

$\begin{matrix} {L = {{- \log}{\sum\limits_{s}{p\left( {\hat{x},s,\hat{y}} \right)}}}} \\ {= {{{- \log}{\sum\limits_{s}{\exp\left( {- {E\left( {\hat{x},\hat{y},s} \right)}} \right)}}} + {\log{\sum\limits_{x,y,s}{\exp\left( {- {E\left( {x,y,s} \right)}} \right)}}}}} \end{matrix}$

In order to minimize the loss function, each parameter w_(ij) is selected based on a stochastic gradient, for example: w_(ij)←w_(ij)−γ∂L/∂w_(ij).

The gradient of the loss function over w_(ij) is: ∂L/∂w _(ij)=−{circumflex over (x)} _(i) s _(j) ⁰+x _(i) ¹ s _(j) ¹ where

s_(j)⁰ = ∫p(s_(j)|x̂, ŷ)s_(j)ds_(j) x_(i)¹s_(j)¹ = ∫p(x_(i), s_(j))x_(i)s_(j)dx_(i)ds_(j)

Generally, x_(i) ¹s_(j) ¹ is difficult to obtain because of the dense inter-layer connections. When the variables x, y, and s are independent of each other, the distribution becomes:

${p\left( {{s_{j} = \left. 1 \middle| x \right.},y} \right)} = {\left( {1 + {\exp\left( {{- {\sum\limits_{i}{2\; x_{i}w_{ij}}}} - {\sum\limits_{n}{2\; y_{n}v_{n\; j}}} - {2\; b_{j}}} \right)}} \right)^{- 1}.}$

In some implementations, x¹, y¹, and s¹ are approximated by the so-called contrastive divergence:

-   -   Sample s⁰ from p(s|{circumflex over (x)}, ŷ);     -   Sample y¹ from p(y|s⁰);     -   Sample x¹ from p(x|s⁰);     -   Sample s¹ from p(s|x¹, y¹).

When a quantum machine is used to train a Boltzmann machine, the implementation may be able to restore at least some of the interactions among the hidden units. Such a modified restricted Boltzmann machine is named as a tempered interacting Boltzmann machine (TIBM). The TIBM can better resemble a neuron network and solve problems more precisely than the restricted Boltzmann machine. The TIBM is different from a restricted Boltzmann machine in at least two ways:

(1) Interactions are added among the hidden units s based on the hardware connections of the quantum machine, e.g., as illustrated in the Chimera graph of FIGS. 3; and

(2) Because the quantum machine can be used as an oracle that performs maximizations, the sampling in hidden layer is replaced by the maximization, which corresponds to temperature T =0.

The energy of the TIBM is:

${- E} = {{\sum\limits_{i}{\sum\limits_{j}{x_{i}w_{ij}s_{j}}}} + {\sum\limits_{n}{\sum\limits_{j}{y_{n}v_{nj}s_{j}}}} + {\sum\limits_{i}{a_{i}x_{i}}} + {\sum\limits_{j}{b_{j}s_{j}}} + {\sum\limits_{n}{c_{n}y_{n}}} + {\sum\limits_{i}{\sum\limits_{{({j,k})} \in E}{x_{i}{\overset{¨}{w}}_{ijk}s_{j}s_{k}}}} + {\sum\limits_{n}{\sum\limits_{{({j,k})} \in E}{y_{n}{\overset{¨}{v}}_{njk}s_{j}s_{k}}}} + {\sum\limits_{{({j,k})} \in E}{{\overset{¨}{b}}_{jk}s_{j}{s_{k}.}}}}$ and its probability distribution is:

${p\left( {x,y,s} \right)} = {\frac{1}{Z}{{\exp\left( {- {E\left( {x,y,s} \right)}} \right)}.}}$

A graphical representation of an example TIBM 600 is shown in FIG. 6. The machine 600 contains three layers, two visible layers 602, 606, and an invisible layer 604. Compared to the restricted Boltzmann machine 500 of FIG. 5, units 610, 612, 614 of the invisible layer 604 interact with each other. At temperature T, the loss function of the TIBM is:

$L_{T} = {{{- \log}\;{\overset{\sim}{p}\left( {\hat{x},\hat{y}} \right)}} = {{- {F_{T}\left( {\log\;{p\left( {\hat{x},s,\hat{y}} \right)}} \right)}} + {\log\;{\sum\limits_{x,y}{{\exp\left( {F_{T}\left( {\log\;{p\left( {x,s,y} \right)}} \right)} \right)}.}}}}}$ In the above equation:

${{F_{T}\left( {f(s)} \right)} = {T\;\log\;{\sum\limits_{s}{\exp\left( {{f(s)}/T} \right)}}}},{{\overset{\sim}{p}\left( {\hat{x},\hat{y}} \right)} = {\frac{1}{Z}{\exp\left( {F_{T}\left( {\log\;{p\left( {\hat{x},s,\hat{y}} \right)}} \right)} \right)}}},$ and Z=Σ _(x, y)exp(F _(T)(log p(x, s, y))).

When T is 0, F_(t)(f(s))→max_(s) f(s), then

$\begin{matrix} {L_{0} = {{{- \log}\;{{\overset{\sim}{p}}_{0}\left( {\hat{x},\hat{y}} \right)}} = {{- \log}\;\frac{\max_{s}{p\left( {\hat{x},s,\hat{y}} \right)}}{\sum\limits_{x,y}{\max_{s^{\prime}}{p\left( {x,s^{\prime},y} \right)}}}}}} \\ {= {{{- \log}\;{\max\limits_{s}{p\left( {\hat{x},s,\hat{y}} \right)}}} + {\log\;{\sum\limits_{x,y}{\max\limits_{s^{\prime}}{{p\left( {x,s^{\prime},y} \right)}.}}}}}} \end{matrix}$

The resulting contrastive divergence algorithm becomes:

-   -   Compute s⁰=argmax p(s|{circumflex over (x)}, ŷ);     -   Sample y¹ from p(y|s⁰);     -   Sample x¹ from p(x|s⁰);     -   Compute s¹=argmax p(s|x¹, y¹).

To do contrastive divergence, the algorithm obtains

-   -   (1) x, y will be the same as the algorithm for a restricted         Boltzmann machine; and     -   (2) s will need the quantum oracle, or approximate inference         methods such as mean-field methods.

FIG. 9 shows an example process 904 performed by a digital computer 900 or system of multiple digital computers and a quantum oracle 902 to learn the inference in a tempered interacting Boltzmann machine (TIBM) and train the TIBM. The classical computer 900 stores a model parameter set Q={w_(ij), v_(ij), a_(i), b_(j), c_(n), w _(ijk), v _(njk), b _(jk)}and training data {circumflex over (x)}, ŷ. The quantum oracle 902 receives input from the classical computer 900 to do probabilistic inference and output model statistics to allow the classical computer 900 to complete training the TIBM.

In particular, the computer 900 initializes (906) the parameters in the set Q. Using the initialized parameter set Q and the training data {circumflex over (x)}, ŷ, the computer derives and outputs (908) h₀ and J₀ for log p(s|{circumflex over (x)},ŷ)to the quantum oracle 902. The quantum oracle 902 then performs adiabatic annealing and outputs (910) s⁰ to the classical computer 900. The classable computer samples (912) x¹ and y¹, and based on s⁰ and Q, calculates and outputs (912) h₁ and J₁ for log p(s|x¹, y¹) to the quantum oracle 902. The quantum oracle again performs adiabatic annealing and outputs (914) s¹ to the classical computer 900. Based on s⁰, s¹, x¹, y¹, {circumflex over (x)}, ŷ, and Q, the classical computer computes the new set of parameters Q. If the computed new Q has converged, the process of training the TIBM is completed. Otherwise, the new set of parameters replaces the initialized Q and the training process continues with the step 908 until the parameters converge.

Engineering Refinements for TIBM

In using the quantum oracle to evaluate the loss functions of different models for training the models, various engineering refinements can improve the performance of the adiabatic algorithms. Some of these are related to the following aspects or parameters:

Label margin: a margin D(y, ŷ) is added, which represents the scaled Hamming distance between y and ŷ, such that

${p\left( {x,s,y} \right)} = {\frac{1}{Z}{{\exp\left( {{E\left( {x,s,y} \right)} + {D\left( {y,\hat{y}} \right)}} \right)}.}}$

Weight regularization: an L₂ regularization, with regularization constant, of weights is added to the loss function.

Centering trick: the input feature x is centered by x←x-α, where αis the mean of x.

Initialization: the following can be initialized, where “all others” refers to parameters v_(ij), c_(n), w _(ijk), v _(njk), and b _(jk):

-   -   w_(ij)˜0.1·N(0, 1);     -   a_(i)=log(α_(i)/(1−α_(i)));     -   b_(j)=−1;     -   all others are set to be 0.

I-BM (interacting Boltzmann machine) pretraining:

-   -   (1) Pre-train the corresponding TRBM (tempered restricted         Boltzmann machine) with all edges in hidden layer being locked         to 0;     -   (2) Train the whole TIBM with unlocked edge parameters after         TRBM pretraining.

Averaged stochastic gradient descent:

-   -   (1) Learning rate reduces with iterations k;         γ=η(1+λkη)^(−0.75).     -   (2) Averaging parameters as output.

$\left. \overset{\_}{w}\leftarrow{{\frac{k}{k + 1}\overset{\_}{w}} + {\frac{1}{k + 1}{w.}}} \right.$

Compatible ASGD (Averaged Stochastic Gradient Descent algorithm) with pretraining:

-   -   (1) Reinitialize k=0 after pretraining.     -   (2) Keep the current learning rate η←γ.         Undirected Graphical Models

As another example of using a quantum machine in a system for or process of training a model, e.g., an undirected graphical model or an exponential model, a three-unit model 700 is shown in FIG. 7. The 3-unit model 700 contains three labels z={z₁, z₂, z₃} as random binary variables. The joint distribution of {z₁, z₂, z₃} is:

$\begin{matrix} {{p\left( {z_{1},z_{2},z_{3}} \right)} = {\frac{1}{Z}{\exp\left( {{h_{1}z_{1}} + {h_{2}z_{2}} + {h_{3}z_{3}} + {J_{12}z_{1}z_{2}} + {J_{23}z_{2}z_{3}}} \right)}}} \\ {= {\frac{1}{Z}{\exp\left( {{\sum\limits_{i}{h_{i}z_{i}}} + {\sum\limits_{{({i,j})} \in E}{J_{ij}z_{i}z_{j}}}} \right)}}} \\ {{= {\frac{1}{Z}{\exp\left( {- {E_{Ising}(z)}} \right)}}},} \end{matrix}$ Where:

$Z = {\sum\limits_{z}{{\exp\left( {- {E_{Ising}(z)}} \right)}.}}$

Given training data z, at least two methods can be used to train the model. The first method is to maximize the log-likelihood log p(z),

${L\left( {h,{j;\hat{z}}} \right)} = {{E_{Ising}\left( \hat{z} \right)} + {\log{\sum\limits_{z}{{\exp\left( {- {E_{Ising}(z)}} \right)}.}}}}$

The log Σ_(z) exp f(z)is the special case of the tempered function with T=1,

${F_{T}(f)} = {T\;\log{\sum\limits_{z}{{\exp\left( {{f(z)}/T} \right)}.}}}$

${{{When}\mspace{14mu} T} = 0},{{F_{0}(f)} = {\max\limits_{z}{{f(z)}.}}}$

This leads to the loss function:

${L\left( {h,{J;\hat{z}}} \right)} = {{E_{Ising}\left( \hat{z} \right)} = {\max\limits_{z}{\left( {- {E_{Ising}(z)}} \right).}}}$

In this 3-unit example, both log Σ_(z)exp(−E_(Ising)(z)) and max_(z)(−E_(Ising)(z)) are relatively simple. However, when the model is more complex and more variables are involved, the computational cost will grow exponentially with the number of variables. Accordingly, it is advantageous to compute the loss function L using a quantum oracle. For example, the D-Wave system as an oracle can be used to estimate the second term max_(z)(−E_(Ising)(z)) . The quantum oracle can be used to evaluate the gradient of the loss function as well.

Undirected Graphical Models with One or More Latent Variables

FIG. 8 shows an undirected graphical model with latent variables 800, which can also be called a latent undirected graphical model. Compared to the model 700 of FIG. 7, a latent undirected graphical model includes one or more latent binary variable s; thus, the model 800 is capable of representing a rich family of distributions, in addition to first and second order interactions. The distribution function for the model 800 is:

$\begin{matrix} {{p\left( {z_{1},z_{2},z_{3},s} \right)} = {\frac{1}{Z}{\exp\left( {{h_{1}z_{1}} + {h_{2}z_{2}} + {h_{3}z_{3}} + {h_{s}s} + {J_{12}z_{1}z_{2}} +} \right.}}} \\ \left. {{J_{23}z_{2}z_{3}} + {J_{1\; s}z_{1}s} + {J_{2\; s}z_{2}s} + {J_{3\; s}z_{3}s}} \right) \\ {= {\frac{1}{Z}{\exp\left( {- {E_{Ising}\left( {z,s} \right)}} \right)}}} \end{matrix}$ Because s is not observed, there are at least two ways to define the loss function. The first way is to sum over s and define the loss function as −logΣ_(s) p({circumflex over (z)}, s) such that

$L = {{- {\log\left( {\sum\limits_{s}{\exp\left( {- {E_{I\;{sing}}\left( {\hat{z},s} \right)}} \right)}} \right)}} + {{\log\left( {\sum\limits_{z,s}{\exp\left( {- {E_{I\;{sing}}\left( {z,s} \right)}} \right)}} \right)}.}}$

The other way is to maximize over s, and define the loss function as:

$L = {{- {\max\limits_{s}\left\{ {- {E_{Ising}\left( {\hat{z},s} \right)}} \right\}}} + {\max\limits_{z,s}{\left\{ {- {E_{Ising}\left( {z,s} \right)}} \right\}.}}}$

The quantum oracle can be readily applied to this loss function to solve the inference of the two max problems.

FIG. 10 shows an example process 1000 in which a digital computer 1002 or system of multiple digital computers and a quantum oracle 1004 learn the inference in an undirected graphical model with one or more latent variables and train the undirected graphical model. The classical computer 900 stores model parameters h_(i) and J_(ij) and training data {circumflex over (z)}. The quantum oracle 1004 receives input from the classical computer 1002 to learn the inference and outputs data to allow the classical computer 1004 to complete training the model.

The process 1000 starts with the classical computer 1002 initializes (1006) the parameters h_(i) and J_(ij), and calculates and outputs (1008) h₁ and J₁ of E_(Ising)(z, s) to the quantum oracle 1004. In addition, the classical computer 1002 also computes and outputs (1012) to the quantum oracle 1004 h₀ and J₀ of E_(Ising)({circumflex over (z)}, s) based on the initialized parameters h_(i) and J_(ij) and the training data {circumflex over (z)}. Based on the received parameters h₁ and J₁, the quantum oracle 1004 computes and outputs (1010) (z¹, s¹)=argmin E_(Ising)(z, s); and based on the received parameters h₀ and J₀, the quantum oracle 1004 computes and outputs (1014) s₀=argmin E_(Ising)({circumflex over (z)}, s). The classical computer 1002 then calculates (1016) and outputs new parameters h_(i) and J_(ij) based on z¹, s¹, s⁰, and {circumflex over (z)}. If the computed new parameters h_(i) and J_(ij) have converged, the process of training the undirected graphical model is completed. Otherwise, the new parameters replace the initialized parameters and the training process continues with the step 1008 until the parameters converge.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable digital processor, a digital computer, or multiple digital processors or computers. The apparatus can also be or further include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program, which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). For a system of one or more computers to be “configured to” perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Control of the various systems described in this specification, or portions of them, can be implemented in a computer program product that includes instructions that are stored on one or more non-transitory machine-readable storage media, and that are executable on one or more processing devices. The systems described in this specification, or portions of them, can be implemented as an apparatus, method, or electronic system that may include one or more processing devices and memory to store executable instructions to perform the operations described in this specification.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. In another example, the implementation of a quantum oracle in learning the inference and training models can be alternatively performed by systems that are different from the D- Wave systems or the AQC systems. For example, a universal quantum computer that relies on quantum phase estimation algorithms to find the ground state of the problem Hamiltonian can be used. Also, in addition to the models discussed above, there are many other supervised and unsupervised machine learning algorithms, for use in classifications, clustering, and inference tasks. A quantum information processor can be used in association with part or all of these algorithms during the training stage, the performance stage, or both stages of the algorithms. In another example, in addition to the training stage, a quantum oracle may also be used for inference during the prediction stage, i.e., at the performance stage, during which loss functions of different candidate output labels are compared. The best candidate output label can be selected based on the lowest loss function value. The systems and methods described in the specification can be used in connection with cloud computing, e.g., to provide training or predicting services to a cloud architecture. 

What is claimed is:
 1. A method performed by a system of one or more computers for probabilistic inference in a model for use in machine learning, the method comprising: receiving data for training the model, the data comprising observed data for training and validating the model, and wherein the model is a modified restricted Boltzmann machine that includes interactions among hidden units of the restricted Boltzmann machine, wherein the interactions are based on hardware connections of a quantum oracle implemented using a quantum machine comprising an adiabatic quantum computing system, the hardware connections comprising couplers that connect qubits included in the quantum oracle; deriving input to the quantum oracle using the received data and a state of the model, the input mapping at least some interactions of different interconnected units of the model to connections between qubits in the quantum oracle; providing the input to the quantum oracle for learning the inference in the model; and receiving from the quantum oracle data representing the learned inference.
 2. The method of claim 1, wherein the model comprises hidden units and visible units, and one or more hidden units interconnect with at least one visible unit.
 3. The method of claim 2, wherein the model is an undirected graphical model and the hidden units are latent variables of the model.
 4. A method performed by a system comprising a quantum oracle for learning an inference in a model for use in machine learning, the model comprising a modified restricted Boltzmann machine that includes at least some interactions among hidden units of the restricted Boltzmann machine, wherein the interactions are based on hardware-connections of the quantum oracle, the quantum oracle is implemented using a quantum machine comprising an adiabatic quantum computing system, and the hardware connections comprise couplers that connect qubits included in the quantum oracle, the method comprising: receiving input that maps at least some interactions of different interconnected units of the model to connections between qubits in the quantum oracle, the input containing data representing at least one aspect of a loss function of the model; and performing quantum adiabatic annealing for a Hamiltonian built based on the received input to learn the inference in the model.
 5. A method performed by a system of one or more computers for probabilistic inference in a model for use in machine learning, the model comprising at least one hidden layer comprising at least two hidden units and at least one visible layer comprising one or more visible units, wherein the at least two hidden units interconnect with each other based on hardware connections of a quantum oracle implemented using a quantum machine comprising an adiabatic quantum computing system, the hardware connections comprising at least one coupler that connects two qubits included in the quantum oracle, and wherein each of the hidden units interconnects with at least one visible unit, and the model has a loss function, the method comprising: receiving data for training the model, the data comprising observed data for training and validating the model; deriving input to a computing system using the received data and a state of the model, the input including data characterizing interactions among the interconnected hidden units and containing data representing at least one aspect of the loss function of the model; providing the input to the computing system for probabilistic inference in the model, comprising providing the input to the quantum oracle for inferring the hidden layer; and receiving from the quantum oracle data representing the inferred hidden layer.
 6. The method of claim 5, wherein the computing system comprises the quantum oracle and the model is a restricted Boltzmann model modified to include the interconnected hidden units.
 7. The method of claim 5, wherein the loss function is $L_{T} = {{{- \log}\;{\overset{\sim}{p}\left( {\hat{x},\hat{y}} \right)}} = {{- {F_{T}\left( {\log\;{p\left( {\hat{x},s,\hat{y}} \right)}} \right)}} + {\log{\sum\limits_{x,y}{\exp\left( {F_{T}\left( {\log\;{p\left( {x,s,y} \right)}} \right)} \right)}}}}}$ where T represents temperature, x represents variables corresponding to those observed data for use as input data during the training of the model, y represents variables corresponding to those observed data used as labels during the training of the model, the function p represents a probability distribution, and the function F represents a tempered log-partition function.
 8. A method performed by a system comprising a quantum oracle implemented using a quantum machine comprising an adiabatic quantum computing system for learning an inference in a model for use in machine learning, the model comprising at least one hidden layer comprising at least two hidden units and at least one visible layer comprising one or more visible units, the at least two hidden units interconnecting with each other, and each of the hidden units interconnecting with at least one visible unit, the method comprising: receiving input that maps at least interactions among the interconnected hidden units to interactions of qubits in the quantum oracle, the input containing data representing at least one aspect of a loss function of the model; and performing quantum adiabatic annealing by the quantum oracle for a Hamiltonian built based on the received input to infer the hidden layer.
 9. A system for learning an inference in a model, the model comprising at least one hidden layer comprising at least two hidden units, wherein the at least two hidden units interconnect with each other, and the system comprising: one or more data storage devices for storing training data for training the model and for storing parameters of the model; a quantum oracle for inferring the hidden layer of the model, the quantum oracle implemented using a quantum machine comprising an adiabatic quantum computing system and comprising qubits, at least some of the qubits interacting with each other; and one or more computers configured to derive input and provide the derived input to the quantum oracle for learning the inference, the input being derived using the stored training data and a state of the model, the input mapping at least some interactions of different interconnected units to the interactions of the qubits in the quantum oracle, and the input containing data representing at least one aspect of a loss function of the model.
 10. The method of claim 1, wherein the hardware connections in the quantum oracle are defined by a Chimera graph.
 11. The method of claim 1, wherein learning the inference in the model comprises learning part of the inference in the model.
 12. The method of claim 4, wherein the hardware connections in the quantum oracle are defined by a Chimera graph.
 13. The method of claim 4, wherein learning the inference in the model comprises learning part of the inference in the model. 